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Abstract 

This  paper  discusses  experiments  involving  a  method  for  the  au- 
tomatic detection,  prior  to  the  integration  of  database  schemas,  of 
conflicts  in  the  naming  of  data  elements  within  these  schemas.  The 
method  relies  on  the  representation  of  semantic  information  (called 
quiddity)  about  the  data  elements  present  in  the  various  schemas.  We 
develop  several  inference  procedures  which,  utilizing  this  information, 
determine  whether  two  distinctly  named  elements  in  fact  represent 
the  same  object,  or  if  elements  with  the  same  name  actually  represent 
different  objects.  The  experiments  are  concerned  with  a)  examining 
the  accuracy  and  consistency  with  which  quiddities  of  data  elements 
might  be  declared  by  different  database  designers,  and  b)  evaluating 
the  accuracy  and  errors  of  these  automated  procedures.  Our  results 
indicate  that  the  method  has  promise  for  use  in  detection  of  naming 
conflicts,  and  that  certain  inference  procedures  are  superior  to  others 
in  terms  of  their  accuracy  and  error  rates. 


*We  would  like  to  express  our  tlianks  to  Jim  Connor,  Mark  Greer,  Rafael  Gacel,  Nancy 
Reinhart,  Jane  Smith,  and  Barbara  Treharne  for  making  the  experiments  a  success. 


1      Introduction 

Successful  integration  of  multiple  database  schemas  with  overlapping  do- 
mains requires  the  identification  and  resolution  of  conflicts  in  the  naming  of 
data  elements  within  the  schemas  of  these  databases.  This  paper  describes  a 
method  for  the  automatic  detection  of  such  naming  conflicts,  and  presents  the 
results  of  a  first  set  of  experiments  involving  the  application  of  this  method. 
The  method  being  tested  builds  upon  a  method  for  detecting  similar  con- 
flicts in  the  integration  of  multiple  mathematical  models  (see  [4]).  It  relies 
on  the  representation  of  certain  semantic  information  (called  quiddity^),  not 
captured  in  data  dictionaries,  about  the  data  elements  or  attributes  present 
in  the  schemas  being  integrated.  The  first  part  of  our  experiments  is  con- 
cerned with  examining  the  accuracy  and  consistency  with  which  quiddities 
of  data  elements  might  be  declared  by  different  database  designers.  The  sec- 
ond part  of  our  experiments  is  concerned  with  an  analysis  and  comparison 
of  the  accuracy  and  errors  of  a  set  of  alternative  procedures  which  we  have 
developed.  These  inference  procedures  utilize  the  quiddity  information  to 
automatically  determine  whether  two  distinctly  named  elements  in  fact  rep- 
resent the  same  object  (the  synonym  problem),  or  if  data  elements  with  the 
same  name  actually  represent  different  objects  (the  homonym  problem). 

It  is  recognized  in  the  database  literature  that  naming  problems  must 
be  detected  and  resolved  prior  to  schema  integration  [6,  16],  and  that  the 
detection  of  these  problems  is  extremely  tedious,  time-consuming,  and  error- 
prone  [2,  11,  15].  Further,  several  methodologies  and  guidelines  have  been 
proposed  for  the  identification  (and  even  prevention)  of  such  conflicts  [7,  12]. 


'From  the  Oxford  English  Dictionary,  quiddity  is  "The  real  nature  or  essence  of  a  thing; 
that  which  makes  a  thing  what  it  is."   [4] 


The  process  is  facilitated  with  automated  tools,  largely  by  providing  quick, 
on-line  access  to  information  about  these  elements.  However,  a  significant 
part  of  the  effort,  that  of  verifying  the  existence  of  a  conflict  for  each  pair  of 
data  elements,  must  still  be  borne  by  the  database  designer.  There  is  also 
not,  to  our  knowledge,  much  empirical  evidence  regarding  the  usefulness  of 
these  methodologies.  Other  automated  tools  are  proposed  to  support  the 
resolution  of  conflicts  (e.g.,  see  [5,  7,  10,  12]),  but  we  will  not  have  anything 
further  to  say  about  this  issue,  since  our  focus  is  on  the  detection  of  these 
conflicts.  We  will  also  not  be  concerned  here  with  other  kinds  of  conflicts 
(e.g.,  structural  conflicts — see  [11,  16,  17])  that  must  also  be  resolved  prior 
to  database  integration. 

More  recently,  the  problem  of  naming  conflicts  has  been  addressed  in  the 
model  management  literature  in  the  context  of  conflicts  in  the  naming  of 
modeling  variables.  The  violation  of  the  unique  names  assumption^  in  the 
naming  of  variables  in  multiple  models  causes  problems  when  these  models 
are  integrated.  Bhargava  et  al.  [4]  argued  that  the  detection  of  such  naming 
problems  requires  knowledge  about  what  these  variables  represent,  and  that 
such  knowledge  must  be  represented  formally  if  the  detection  of  naming  con- 
flicts is  to  be  automated.  They  proposed  that  modeling  variables  be  further 
defined  in  terms  of  their  quiddities,  dimensions,  and  units  of  measurement, 
and  illustrated  with  several  examples  how  this  information  would  be  useful 
in  detecting  naming  conflicts.  They  discussed  a  formal  representation  for 
quiddities,  and  suggested  that  two  variables  (named  distinctly)  be  consid- 
ered as  candidates  for  a  possible  violation  of  the  unique  names  assumption  if 


It  is  often  useful  and  convenient  to  assume  in  software  systems  that  every  individual 
has  at  most  one  name,  unless  stated  otherwise.  This  assumption  is  called  the  unique 
names  assumption  [9]. 


they  had  the  same  quiddity  and  dimension.  We  present  a  summary  of  their 
approach  in  §2.1. 

This  approach  raises  several  questions  regarding  the  use  of  quiddities  for 
the  detection  of  unique  names  violations.  For  example,  can  people  define 
quiddities  correctly?  Do  quiddities  capture  sufficient  information  to  ensure 
detection  of  these  violations?  What  procedures  must  be  designed  to  make 
this  detection,  and  how  accurate  will  they  be?  Early  in  our  research,  we 
conducted  a  preliminary  experiment,  involving  a  group  of  six  subjects,  in 
which  we  examined  the  clarit}^  and  feasibility  of  the  concept  of  quiddities. 
We  discuss  this  experiment  and  its  results  in  §2.2.  We  used  these  results,  as 
well  as  a  careful  analysis  of  the  idea  of  quiddities  as  it  applies  to  the  database 
world,  to  substantially  refine  the  concept,  and  to  develop  a  set  of  guidelines 
that  would  assist  a  database  designer  in  correctly  defining  quiddities  of  data 
elements  (see  §2.3).  We  then  developed  several  alternative  inference  proce- 
dures that  utilize  quiddity  information  in  the  detection  of  naming  conflicts. 
These  procedures,  and  the  rationale  for  each  of  them,  are  discussed  in  §3. 
Finally,  we  conducted  an  experiment  to  gather  information  about  how  suc- 
cessful this  approach  might  be  in  detecting  naming  conflicts.  Specifically, 
we  were  concerned  with  two  questions.  First,  was  our  concept  of  quiddities, 
and  our  guidelines  for  developing  them,  clear  and  precise  enough  so  that 
two  different  individuals  would  develop  equivalent  quiddities  for  the  same 
element?  Second,  could  this  information  be  gainfully  employed  to  automate 
the  detection  of  naming  conflicts,  and  if  so,  what  would  be  the  accuracy  and 
error  rates  of  the  various  inference  procedures?  We  examine  these  questions 
and  our  experiment  in  §4. 


2      Capturing  Semantic  Information  about  Data 
Elements  using  Quiddities 

In  this  section,  we  explain  the  concept  of  quiddities,  as  proposed  by  Bhargava 
et  al.,  and  as  refined  by  us,  and  present  certain  guideHnes  that  we  believe 
will  facilitate  the  correct  declaration  of  quiddities. 

2.1      Model  Integration:  Unique  Names  Violations  and 
Quiddities 

The  violation  of  the  unique  names  (of  modehng  variables)  assumption  causes 
a  problem  in  model  integration  similar  to  the  one  caused  by  naming  conflicts 
in  database  integration.  The  homonym  and  synonym  problems  are  both  spe- 
cial cases  of  unique  names  violations  (UNVs).  After  an  analysis  of  informa- 
tion requirements  for  detecting  UNVs,  Bhargava  et  al.  concluded  that  such 
detection  required  descriptive  information  about  the  variables,  and  that  this 
information  be  represented  using  a  descriptive  apparatus  that  was  sufficiently 
rich  and  unambiguous.  They  suggested  that  the  quiddity  and  dimension  of 
variables  be  represented  formally,  and  argued  that  if  two  distinctly  named 
variables  had  equivalent  dimensions  and  quiddities,  then  those  variables  pos- 
sibly constituted  a  UNV. 

The  quiddity  of  a  modeling  variable  (or  data  element)  provides  a  descrip- 
tion of  "what  it  is  the  variable  is  about",  and  in  particular,  information  that 
is  relevant  to  UNV  detection  [4].  Bhargava  et  al.  proposed  a  formal  language 
for  representing  the  quiddity  of  a  variable.  In  this  language,  quiddity  is  de- 
fined in  terms  of  six  categories  of  information  about  the  variable:  stuff,  types 
of  stuff,  attributes  of  stuff,  types  of  attributes  of  stuffs  and  metafunctions. 
They  showed,  using  several  examples,  that  these  six  categories  were  required 


to  be  able  to  distinguish  the  quiddities  of  variables  that  appeared  the  same  but 
were  really  not  (and  so  did  not  pose  a  UNV  problem).  The  quiddity  expres- 
sion for  a  variable  is  developed  by  specifying  terms,  from  a  given  vocabulary, 
for  (some,  or  all,  of)  these  categories,  and  combining  these  terms  according 
to  the  grammar  for  the  language.  These  components,  and  the  quiddity,  are 
illustrated  using  the  following  two  examples.  (For  our  purpose,  the  rules  of 
formation  for  representing  quiddity  expressions  in  the  formal  language  are 
not  relevant,  and  will  not  be  disucssed  here.) 

Example  1    tail-number 

Description.'  Tail  number  of  a  fighter  aircraft. 
Quiddity.-  tail-number  (fighter  (aircraft)) 

Example  2    command-com 

Description;  Indicates  whether  or  not  damage  is  caused  by  a  virus  to  an 
operating  system. 

Quiddity;  indicator  (damage  (virus, system)) 

The  component  stuff  answers  the  question  "What  is  the  object  this  vari- 
able is  about?"  Stuff  is  usually,  but  not  necessarily,  indicated  by  a  noun, 
describing  individual  things  or  collections  of  individual  things,  such  as  cars, 
trucks,  or  ships.  In  examples  1  and  2  above,  the  stuff  terms  are  aircraft  and 
damage,  respectively.  A  stuff  term  may  have  an  associated  arity  if  one  or 
more  arguments  are  required  to  fully  define  it.  In  example  2  above,  we  are 
interested  in  damage  6?/ something  (virus)  ^o  something  (system).  Therefore, 
damage  has  arity  2,  with  the  arguments  virus  and  system.  The  component 
stuff  type  answers  the  question  "What  sort  of  or  kind  of  stuff  is  it?"  Stuff 
types  further  describe  stuff.  In  example  1,  the  stuff  type  term  _^^/j/er  qualifies 
the  stuff  expression  aircraft. 


The  component  stuff  attribute  answers  the  question  "What  is  it  about  the 
stuff  that  you  are  interested  in?"  In  the  above  examples,  we  are  interested 
in  the  tail-number  of  the  aircraft,  and  whether  or  not  {indicator)  there  is 
damage  by  the  virus  to  the  system.  The  component  stuff  attribute  type 
answers  the  question  "What  sort  of  or  kind  of  stuff  attribute  is  it?"  Stuff 
attribute  types  further  qualify  stuff  attribute.  None  of  examples  1  and  2  have 
a  stuff  attribute  type,  but,  for  instance,  an  attribute  cost  might  be  qualified 
as  a  purchase  cost  or  a  production  cost. 

Metafunctions  capture  information,  usually  statistical  or  mathematical, 
about  the  attribute  of  the  data  element.  Examples  of  metafunctions  are 
average,  maximum,  minimum,  sum,  and  variance.  None  of  examples  1  and  2 
have  a  metafunction,  but,  for  instance,  an  attribute  cost  might  be  an  average 
cost  or  a  minimum,  cost. 

Bhargava  et  al.  recognized  that  the  quiddity  expressions  in  their  language 
only  approximated  the  actual  quiddities  of  variables.  They  argued  that,  in 
spite  of  this  appoximation,  the  concept  was  sufficiently  rich  and  expressive  to 
be  of  use  in  UNV  detection.  Assuming  that  is  true,  the  use  of  quiddities  for 
UNV  detection  raises  several  questions.  Is  it  possible  that  different  people 
will  specify  a  different  quiddity  for  the  same  variable  (even  given  the  same 
information  about  it)?  Are  the  quiddity  categories  general  enough  to  capture 
relevant  information  in  typical  database  applications?  Are  the  various  quid- 
dity categories  clear  and  meaningful?  If  not,  which  of  these  are  not  clearly 
understood?  While  our  interest  went  beyond  these  issues,  we  conducted  a 
small  experiment  to  gain  an  understanding  of  the  answers  to  these  questions. 


2.2      Quiddities?   A  Preliminary  Experiment 

We  conducted  a  preliminary  experiment  to  examine  the  above-mentioned 
issues  in  quiddity  acquisition.  Two  databases,  overlapping  in  their  real  world 
domains,  were  used  as  the  basis  for  this  experiment.  These  databases  were 
developed  by  separate  teams.  Twelve  data  elements  from  each  database  were 
selected  for  quiddity  formulation.  We  ensured  that  unique  name  violations 
did  exist  among  the  selected  subsets  of  these  two  databases.  Each  subject 
was  given  a  packet  which  contained  the  following:  an  overall  information 
sheet,  a  basic  instruction  sheet,  a  work  sheet  (for  practice  and  instructional 
purposes  prior  to  beginning  the  experiment),  a  general  (i.e.,  the  terms  were 
not  separated  by  quiddity  category)  vocabulary  list,  a  list  of  standard  data 
dictionary  entries  for  the  selected  data  elements,  and  sample  output  reports 
from  the  databases  displaying  data  values  for  the  selected  elements.  All 
subjects  were  provided  with  instruction  on  the  concept,  representation,  and 
rules  of  formation  of,  quiddities.  Sample  quiddity  problems  were  discussed 
with  the  subjects  prior  to  beginning  the  experiment.  (See  [3]  for  details.) 

Six  subjects,  three  for  each  database,  participated  in  the  experiment. 
Each  subject  was  asked  to  independently  formulate  quiddities  for  the  ele- 
ments in  the  database  assigned  to  the  subject.  Thus  for  each  of  the  two 
databases,  we  had  a  group  of  three  subjects  formulating  quiddities  for  the 
same  data  elements  using  the  same  set  of  information  about  these  elements. 
The  subjects  were  advised,  though  not  restricted,  to  use  the  vocabulary  pro- 
vided in  the  vocabulary  list. 

We  performed  the  following  across-subject  analyses  within  each  group, 
using  the  quiddities  formulated  by  the  subjects.  First,  within  each  group, 
we  compared  the  quiddities  developed  by  the  subjects  with  the  "correct" 


quiddity  (determined  prior  to  the  experiment).  We  found  that  very  few 
quiddities  were  correctly  defined — there  were  no  matches  for  group  1  and 
7  matches  for  group  2,  out  of  a  maximum  of  36  possible  matches  in  each 
group.  (Two  quiddity  expressions  matched  only  when  they  agreed  on  every 
quiddity  component.)  Second,  for  each  pair  of  subjects  in  the  same  group, 
we  compared  the  quiddities  developed  by  those  subjects.  Again,  we  found 
that  very  few  quiddities  were  identically  defined — there  were  only  2  matches 
for  group  1  and  3  matches  for  group  2,  out  of  a  maximum  of  36  possible 
matches  in  each  group. 

Comparisons  of  individual  quiddity  components  showed  that  the  subjects 
were  often  not  able  to  correctly  identify  the  stuff  and  stuff  attribute  (the 
performance  on  the  other  categories  was  even  poorer).  Compared  with  the 
correct  quiddities,  there  were  7  stuff  matches  in  group  1,  24  stuff  matches 
in  group  2,  24  attribute  matches  in  group  1,  and  14  attribute  matches  in 
group  2,  all  out  of  a  maximum  of  36.  Compared  within  the  groups,  the 
numbers  were  13,  22,  16,  and  13,  respectively,  again  out  of  a  maximum  of 
36.  There  were  several  cases  where  the  subjects  interchanged  the  stuff  with 
the  stuff  attribute. 

What  do  we  learn  from  these  results?  We  find,  a)  even  though  this  was 
a  small  experiment,  b)  the  subjects  had  only  a  quick  introduction  to  the 
concept  of  quiddities,  and  c)  we  defined  a  "match"  very  strictly,  that  there 
was  much  confusion  in  applying  the  definition  and  concept  of  quiddity  and 
its  components.  The  definitions  and  meanings  of  the  various  categories  were 
not  sufficient  or  unambiguous  enough  for  the  subjects  to  develop  quiddities 
in  a  manner  consistent  with  the  actual  concepts.  There  w^as  a  lack  of  clear 
distinction  between  the  stuff  and  stuff  attribute  components,  the  two  most 
significant  quiddity  categories.  This  led  to  confusion  in  determining  the  arity 


of  the  stuff  component  and  in  identifying  the  sortal  information  provided  by 
the  stuff  type  and  the  stuff  attribute  type.  Further,  the  subjects  were  unclear 
about  the  level  of  detail  at  which  they  needed  to  define  the  quiddities. 

2.3      Guidelines  for  Developing  Quiddities 

We  used  the  results  of  our  preliminary  experiment  and  feedback  from  the 
subjects  to  re-analyze  the  concept  of  quiddities.  We  found  that  it  was  still 
useful  to  represent  quiddity  in  terms  of  the  six  categories  discussed  earlier. 
However,  we  refined  our  interpretations  of  some  of  these  categories.  We 
concluded  that  each  data  element  must  have  exactly  one  stuff  term  and 
exactly  one  stuff  attribute  term.  The  stuff  attribute  is  a  measurable  aspect 
(the  thing  being  measured)  of  the  stuff,  and  is  best  indicated  by  examining 
some  of  the  data  values  corresponding  to  the  data  element.  The  stuff  is  then 
simply  the  thing  that  this  measurement  is  about.  It  is  particularly  useful  to 
include  an  attribute  called  indicator — this  is  useful  for  data  elements  with 
Boolean  or  similar  values,  which  indicate  the  status  of  some  property  (the 
stuff)  of  the  data  element.  We  have  also  suggested  several  changes  in  the 
quiddity  acquisition  process  based  on  our  analysis  of  the  information  being 
captured  in  the  quiddity  components.  For  lack  of  space  (see  [3]  for  details), 
we  will  only  summarize  the  results  and  guidelines  for  quiddity  formulation 
that  were  derived  from  this  analysis.  These  guidelines  are  listed  below. 

1.  Gather  Information:  Examine  the  definition  of  the  data  element  us- 
ing information,  such  as  that  in  the  data  dictionary,  about  the  data 
element. 

2.  Examine  Data:  Examine  a  collection  of  actual  data  values,  and  their 
units  of  measurement  (if  any),  for  the  data  element.  Answer  the  ques- 
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tions  "What  does  this  data  represent?"  "What  are  these  values  a  mea- 
sure of?"  For  example,  "John"  and  "Mary"  are  names,  $21  and  $40 
represent  costs,  and  {0, 1}  values  are  indicators  of  something. 

3.  Identify  Attribute:  Identify  the  stuff  attribute  by  examining  the  data 
values  and  data  definition.  The  attribute  is  usually  a  noun,  and  is  a 
measurable  (in  the  abstract  sense)  item. 

4.  Identify  Stuff:  Now  identify  the  stuff  term  by  looking  at  the  attribute 
term  and  asking  the  question  "This  is  an  attribute  of  WhatT^  The 
stuff  is  also  generally  a  noun  and  is  the  object  of  a  prepositional  phrase 
associated  with  the  stuff  attribute.  For  example,  if  the  attribute  is  cost, 
the  question  "Cost  of  what?"  will  lead  to  the  stuff  term. 

5.  Identify  Remaining  Components.  Answer  the  questions  "What  sort  of 
stuff  is  it?"  (the  stuff  type),  "What  sort  of  stuff  attribute  is  it?"  (the 
stuff  attribute  t3'pe),  and  "Is  the  stuff  term  a  function  of  something 
else?"  (stuff  arguments). 

6.  Verify  Terms:  Ensure  that  the  terms  are  present  in  the  appropriate  cat- 
egory in  the  vocabulary,  and  have  the  same  interpretation  as  intended. 
If  not,  select  a  suitable  term  from  the  vocabulary. 

3      Procedures  for  Determining  Quiddity  Equiv- 
alence 

In  this  section,  we  present  automated  procedures  for  the  detection  of  possi- 
ble naming  conflicts  in  schemas  of  different  databases.  It  is  useful,  for  this 
purpose,  to  view  the  stuff,  stuff- arguments,  and  stuff-type  terms  collectively 
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as  the  stuff-part,  and  to  view  the  stuff- attribute  and  stuff- attribute-type  terms 
as  the  attribute-part.  Then,  we  define  two  quiddities  to  be  equivalent  if  and 
only  if  they  have  equivalent  stuff-parts  and  equivalent  attribute-parts.  We  use 
the  symbol  =  for  equivalence. 

We  need  to  operationalize  this  definition  of  quiddity  equivalence  by  defin- 
ing rules  for  stuff-part  and  attribute-part  equivalence.  We  also  need  rules  for 
establishing  whether  or  not  two  terms  are  equivalent.  The  alternative  quid- 
dity equivalence  procedures  we  propose  here,  and  in  particular,  our  rules  for 
term  equivalence,  stuff-part  equivalence,  and  attribute-part  equivalence,  are 
motivated  by  certain  of  our  hypotheses  regarding  how  different  people  may 
interpret  and  specify  quiddities.  We  begin  by  stating  these  hypotheses,  and 
follow  that  by  specifying  the  alternative  rules  and  procedures. 

1.  Stuff  &nd  Stuff- attribute  are  the  most  significant  quiddity  components. 

2.  Different  people  are  likely  to  choose  terms  of  different  specificity  in 
defining  the  same  quiddity.  For  example,  one  person  might  use  the 
term  vehicle  for  the  same  component  for  which  another  person  chose 
the  more  specific  term  truck. 

3.  There  is  scope  for  confusion  between  the  stuff-type  and  the  stuff- arguments 
components. 

4.  Some  people  are  likely  to  define  quiddities  more  extensively  than  others. 
For  example,  one  might  use  the  stuff  type  terms  fighter  and  unmanned 
to  qualify  the  stuff  term  aircraft  of  example  1. 
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3.1  Term  Equivalence 

When  is  one  term  equivalent  to  another?  Clearly,  they  are  equivalent  when 
they  are  exactly  the  same.  They  could  also  be  considered  equivalent  if  one 
is  the  synonym  of  the  other.  Finally,  based  on  hypothesis  2,  they  could  be 
considered  equivalent  if  they  are  in  the  same  "class,"  and  one  term  is  more 
(or,  less)  specific  than  the  other.  To  operationalize  the  last  two  cases,  we 
will  assume  that  there  are  two  relationships  defined  between  terms  in  the 
vocabulary.  First,  the  binary  relation  synonyms,  such  that  synonyms  (a,  (3) 
is  true  when  a  and  ^  are  synonyms.  This  relation  is  transitive  as  well  as 
commutative.  Second,  the  binary  relation  is-a,  such  that  is-a{a,j3)  is  true 
when  Q  is  a  specialization  of  (3.  This  relation  is  transitive.  We  write  a  =t,  (3 
to  mean  that  a  is  equivalent  to  /3  using  term  equivalence  rule  T,.  Then  we 
define  the  following  three  alternate  rules  for  term  equivalence. 

1.  Of  =7j  /5  if  Q  and  /?  are  syntactically  the  same. 

2.  Q  =^2  P  if  Q  =Ti  /^  or  synonyms{Q,  (3). 

3.  a  =73  /?  if  a  =t2  /3  or  is-a{Q,/3),  or  i5-a(/?,  q). 

We  note  that  a  sef  of  terms  is  equivalent  to  another  set  of  terms  if  there 
is  some  permutation  of  the  terms  in  one  set,  such  that  the  i^^  term  of  that 
set  is  equivalent  to  the  i^'^  term  of  the  other  {i  ranging  from  1  to  the  number 
of  terms  in  the  set). 

3.2  StufF-Part  Equivalence 

When  is  the  stuff-part  of  one  quiddity  equivalent  to  the  stuff-part  of  another? 
In  the  simplest  and  strictest  case,  when  the  stuff  term,  argument  terms,  and 
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stuff-type  terms,  in  one  are  equivalent  to  the  stuff  term,  argument  terms, 
and  stuff-type  terms,  respectively,  in  the  other.  (Note  that  any  of  the  three 
rules  for  term  equivalence  could  be  used  in  this  rule,  and  in  the  remaining 
stuff-part  equivalence  rules.)  Second,  (motivated  by  hypothesis  4),  when  the 
argument  terms  of  one  are  a  subset^  of  the  argument  terms  of  the  other,  the 
stuff-type  terms  of  one  are  a  subset  of  the  stuff-type  terms  of  the  other,  and 
the  stuff  terms  are  equivalent.  Third,  (motivated  by  hypotheses  3  and  4), 
when  the  argument  and  stuff-type  terms  (collectively)  of  one  are  a  subset 
of  the  argument  and  stuff-type  terms  of  the  other,  and  the  stuff  terms  are 
equivalent.  Note  that,  motivated  by  hypothesis  1,  the  stuff  terms  must  be 
equivalent  in  each  of  these  rules.  Fourth,  and  least  strictly,  when  the  entire 
stuff-part  of  one  is  a  subset  of  the  stuff-part  of  the  other. 

We  write  q  =gj  l3  to  mean  that  the  stuff-part  q  is  equivalent  to  stuff-part 
/3  using  stuff-part  equivalence  rule  Sj  in  conjunction  with  term  equivalence 
rule  i.  Then  we  define  the  following  four  alternate  rules  for  stuff-part  equiv- 
alence. 

1.  Q  =51  /3  if 

(a)  stuff(Q)  =T,  stuff(/3), 

(b)  arguments(a)  =t,  arguments(/?),  and 

(c)  stuff-type(Q)  =T^  stuff- type(/9). 

2.  a  =52  ^  if 

t 

(a)  stuff(Q)  =T,  stuff(/9), 


^Since  the  direction  of  the  subset  relationship  between  two  sets  of  terms  is  irrelevant 
in  our  rules,  we  will  use  the  svmbol  ~  to  mean  that  one  is  a  subset  of  the  other. 
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(b)  arguments(a)  ~r,  argu merit s(/?),  and 

(c)  stufF-type(a)  ~7,  stuff- type(/3). 

3.  a  =53  /?  if 

(a)  stuff(Q:)  =T,  stuff(/5),  and 

(b)  (arguments,stuff-type)(Q)  ~j, 
(arguments,stuff-type)(/?). 

4.  a  =54  jS  if 

(a)   (stuff,arguments,stuff-type)(Q')  ~r, 
(stuff,arguments,stuff-type)(/S). 

3.3      Attribute-Part  Equivalence 

When  is  the  attribute-part  of  one  quiddity  equivalent  to  the  attribute-part 
of  another?  In  the  simplest  and  strictest  case,  when  the  attribute  term 
and  attribute-type  terms  in  one  are  equivalent  to  the  attribute  term  and 
attribute-type  terms,  respectively,  in  the  other.  (Again,  any  of  the  three 
rules  for  term  equivalence  could  be  used  in  this  rule,  and  in  the  remaining 
attribute-part  equivalence  rules.)  Second,  (motivated  by  hypothesis  4),  when 
the  attribute-type  terms  of  one  are  a  subset  of  the  attribute-type  terms  of 
the  other,  and  the  attribute  terms  are  equivalent.  Note  that,  motivated  by 
hypothesis  1,  the  attribute  terms  must  be  equivalent  in  both  of  these  rules. 
Third,  and  least  strictly,  when  the  entire  attribute-part  of  one  is  a  subset  of 
the  attribute-part  of  the  other. 

We  write  a  =^k   f3  to  mean  that  the  attribute-part  a  is  equivalent  to 
attribute-part  (3  using  attribute-part  equivalence  rule  Ak  in  conjunction  with 
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term  equivalence  rule  i.  Then  we  define  the  following  three  alternate  rules 
for  attribute-part  equivalence. 

1.  a  =^j  ^if 

(a)  attribute(a)  =j_  attribute(^),  and 

(b)  attribute-type(Q)  =j,  attribute-type(/?). 

2.  Q  =^2  (3  if 

(a)  attribute(a)  =7,  attribute(/?),  and 

(b)  attribute-type(Q)  ~7^  attribute-type(/^). 

3.  a  =^3  j3  if 

(a)    (attribute,attribute-type)(Q)  c:^^, 
(attribute, attribute- type)  (/?). 

3.4      Inference  Procedures 

An  inference  procedure  P,jt  is  simply  a  combination  of  rules  T,,  5j,  and  Ak 
for  determining  term,  stuff-part,  and  attribute-part  equivalence,  respectively. 
We  write  (f)  =^ji;  tp  to  mean  that  the  quiddity  </>  is  equivalent  to  quiddity  0 
using  procedure  Ptjk-  Based  on  the  equivalence  rules  discussed  above,  there 
are  36  (3  x  4  x  3)  possible  procedures.  However,  given  the  motivations  be- 
hind the  equivalence  rules,  only  12  procedures — Pm,  P122,  Pi32,  and  ^,43 
(z  =  1,2,3) — are  meaningful.  These  equivalence  rules  and  procedures  were 
implemented  in  the  Edinburgh  syntax  of  Prolog  [14]  and  tested  on  a  Macin- 
tosh implementation  of  Prolog. 

Each  inference  procedure  will  determine  whether  or  not  a  pair  of  variables 
constitutes  a  naming  conflict  (homonym  or  synonym  problem).    There  are 
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two  kinds  of  errors,  called  Type  1  and  Type  2  errors,  that  a  procedure  can 
commit.  A  type  1  error  occurs  when  the  procedure  indicates  a  UNV  problem 
when  in  fact  there  is  none.  A  type  2  error  occurs  when  the  procedure  fails 
to  indicate  a  problem  when  in  fact  there  is  one.  The  second  one  is  the  more 
important  to  avoid,  since  our  objective  is  to  detect  UNVs.  In  general,  let  w^ 
and  it»2  denote  the  weights  assigned  to  these  two  kinds  of  errors.  (A  higher 
weight  indicates  that  it  is  more  costly  to  commit  an  error,  and  W2  will  usually 
be  much  greater  than  w^.  )  Suppose  that  for  a  given  pair  of  databases,  a 
procedure  commits  rii  errors  of  type  1  and  n2  errors  of  type  2.  Then  the 
weighted  error  rate  for  that  procedure  is  given  by 

E^  =  n-iw-i  +  n2W2  (1) 

The  ratio  i^  =  ^^  is  a  measure  of  the  relative  weight  of  these  two  errors.  It 
will  be  convenient  to  set  1^1  to  1  (so  that  w  —  u'2),  and  to  vary  W2  depending 
on  the  relative  importance  of  avoiding  type  2  errors.  A  procedure  dominates 
another  procedure  if  it  commits  fewer  errors  of  both  types.  However,  in 
general,  if  a  procedure  commits  fewer  errors  of  one  type,  it  is  likely  to  commit 
more  errors  of  the  other  type.  In  that  case,  the  weighted  error  rate,  with  a 
suitable  choice  of  W2^  can  be  used  to  compare  various  procedures. 

4      Quiddity  Acquisition  and   Inference:    An 
Experiment 

In  this  section,  we  describe  an  experimental  investigation  of  the  usefulness 
of  quiddities  in  the  detection  of  naming  conflicts.  We  first  describe  the  ex- 
periment and  its  design,  and  then  examine  the  results  of  this  experiment  in 
terms  of  a)  the  correctness  of  specification  of  quiddities,  and  b)  the  accuracy 
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of  the  alternative  inference  procedures  in  the  detection  of  naming  conflicts. 

4.1  Experiment  Design 

We  conducted  a  second  experiment  to  investigate  the  usefulness  of  a)  our 
guidelines  for  developing  quiddities  (§2.3),  and  b)  our  procedures  for  deter- 
mining quiddity  equivalence  (§3).  The  experiment  involved  two  new  data- 
bases, also  developed  by  different  teams.  This  experiment  was  designed  and 
conducted  in  a  manner  similar  to  that  of  the  preliminary  experiment  (§2.2), 
except  for  the  following  variations.  Each  of  the  databases  had  15  data  ele- 
ments. There  were  5  synonym  and  3  homonym  problems  in  these  schemas. 
In  this  experiment,  the  vocabulary  provided  to  the  subjects  was  classified  by 
quiddity  category,  and  the  subjects  were  restricted  to  using  only  the  terms 
in  the  vocabulary. 

4.2  Experiment  Results:  Quiddity  Acquisition 

The  experiment  results  were  again  divided  into  two  groups,  one  for  each 
database.  There  are  a  total  of  45  quiddities  developed  by  subjects  in  each 
group,  three  for  each  of  the  fifteen  data  elements. 

We  performed  the  same  across-subject  analyses  within  each  group  as  in 
the  preliminary  experiment.  Comparing  these  quiddities  with  the  correct 
ones,  we  found  that  few  quiddities  were  correctly  defined — there  were  13 
matches  for  group  1  and  15  matches  for  group  2,  out  of  a  maximum  of  45 
possible  matches  in  each  group.  (These  numbers  do  increase  if  we  allow 
for  the  use  of  synonym  terms  or  for  the  use  of  more  or  less  specific  terms.) 
Comparing  quiddities  for  each  pair  of  subjects  in  the  same  group,  we  found 
that  few  quiddities  were  identically  defined — there  were  only  10  matches  for 
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group  1  and  7  matches  for  group  2,  out  of  a  maximum  of  45  possible  matches 
in  each  group.  These  numbers  represent  a  significant  increase  over  those 
obtained  in  the  previous  experiment. 

Comparisons  of  individual  quiddity  components  showed  that  the  subjects 
were  now  usually  able  to  correctly  identify  the  stuff  but  were  still  not  per- 
forming well  on  the  stuff  attribute  (the  performance  on  the  other  categories 
was  again  poorer  than  on  these  two).  Compared  with  the  correct  quiddi- 
ties, there  were  39  stuff  matches  in  group  1,  35  stuff  matches  in  group  2,  25 
attribute  matches  in  group  1,  and  27  attribute  matches  in  group  2,  all  out 
of  a  maximum  of  45.  Compared  within  the  groups,  the  numbers  were  13, 
22,  16,  and  13,  respectively,  again  out  of  a  maximum  of  45.  These  numbers 
again  represent  a  significant  improvement  over  the  results  of  the  previous 
experiment.  There  were  very  few  instances  in  which  subjects  interchanged 
terms  between  stuff"  and  attribute  in  this  experiment. 

An  examination  of  the  quiddities  developed  by  various  subjects  showed 
that  in  spite  of  the  improvements  over  the  previous  experiment,  there  were 
still  inconsistencies  across  subjects  in  the  specification  of  the  stuff  type,  stuff 
arguments,  and  attribute  type  terms.  These  inconsistencies  reflect  differences 
in  the  specificity  of  terms  chosen  for  quiddity  components  (see  hypothesis  2). 
They  also  reflect  uncertainty  about  the  level  of  detail  required  in  specifying 
a  quiddity  (see  hypothesis  4).  Some  subjects  demonstrated  a  tendency  to  be 
consistently  more  descriptive  than  others,  i.e.,  they  listed  more  terms  for  the 
stuff  type  and  stuff  attribute  type  component  than  other  subjects. 

What  do  these  results  say  about  the  usefulness  of  quiddities  in  UNV 
detection?  The  percentages  of  correctly  specified  quiddities  (and  quiddity 
conaponents)  are  still  fairly  low.  However,  these  results  were  based  on  the 
definition  of  a  quiddity  "match"  as  a  strict  equivalence  of  all  components, 
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i.e.,  procedure  Pni  was  implicitly  utilized  to  determine  quiddity  equivalence. 
Are  other  procedures  more  appropriate  for  determining  quiddity  equivalence? 
Can  the  inconsistencies  in  quiddity  specification  be  compensated  for  by  more 
sophisticated  quiddity  equivalence  procedures?  We  now  move  on  to  an  ex- 
amination of  these  questions. 

4.3      Experiment  Results:  Inference  Procedures 

Recall  that  there  were  15  data  elements  in  each  database  schema  and  3 
subjects  in  each  of  the  two  groups.  There  were  5  synonym  problems  and 
3  homonym  problems  in  the  two  database  schemas.  We  used  each  of  the 
12  inference  procedures  to  compare  quiddities  developed  by  each  of  the  9 
pairs  {si,S2)  of  subjects  with  subject  Sj  belonging  to  group  j.'^  For  each 
comparison  of  pairs  of  data  elements,  each  procedure  determined  whether 
or  not  the  element  names  had  an}'  naming  conflict.  Similarly,  we  used  each 
procedure  to  examine  naming  conflicts  using  the  correct  quiddities  for  the 
elements  in  each  database. 

4.3.1      Results  of  Procedures:  Examples 

We  begin  by  illustrating  the  results  of  selected  procedures  on  a  small  set  of 
data  elements.  Databases  1  and  2  both  contained  information  about  courses 
offered  at  the  Naval  Postgraduate  School  (NPS).  From  database  1,  consider 
elements  DPT  (designates  a  department  at  NPS),  PREQ-DPT  (code  identi- 
fying a  prerequisite  department),  and  EMPH-ARE.4  (name  of  an  emphasis 
area  available  to  students  as  an  area  of  study  within  a  particular  curriculum). 
From  database  2,  consider  the  elements  DEPT  and  PREREQ-DEPT  (code 


'That  results  in  (15  x  15)  x  12  x  9  =  24,300  comparisons  for  across-subject  analyses. 
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Data 

Stuff 

Argu- 

Stuff- 

Attribute 

Attribute 

Element 

ments 

Type 

Type 

DPT 

department 

NPS 

designator 

PREQ-DPT 

department 

prerequisite 

identifier 

EMPH-AREA 

emphasis-area 

curriculum 

title 

DEPT 

department 

NPS 

identifier 

PREREQ-DEPT 

department 

NPS,  prerequisite 

identifier 

EMPH 

emphasis- area 

NPS 

identifier 

EMPH-NAME 

emphasis-area 

NPS 

title 

Table  1:  Examples  of  Quiddities  of  selected  Data  Elements 


Database- 1                        Database-2 
Element                                  Element 

Are  they 
Synonyms? 

Detected  as  Synonyms  by 

An 

P232 

-^343 

1 

2 
3 

DPT                                       DEPT 
PREQ-DPT          PREREQ-DEPT 
EMPH-AREA           EMPH-NAME 

Yes 
Yes 

Yes 

No 

No 
No 

Yes 
Yes 
No 

Yes 
Yes 
Yes 

4 
5 

PREQ-DPT                           DEPT 
EMPH-AREA                       EMPH 

No 
No 

No 
No 

No 
No 

No 
Yes 

6 

PROF-PHONE                         SSN 

No 

No 

No 

Yes 

Table  2:  Examples  of  Synonym  Detection  using  selected  Procedures 

identifying  a  prerequisite  department  at  NPS),  which  are  really  synonyms  for 
DPT  and  CRS  respectively.  From  database  2,  also  consider  elements  EMPH 
(code  identifying  the  emphasis  area),  and  EMPH-NAME  (title  of  an  empha- 
sis area  that  students  may  select  at  NPS).  The  quiddities  for  these  elements 
are  indicated  in  table  1,  and  the  results  (for  synonj^m  detection)  of  applying 
procedures  Pm,  P232,  and  P343  are  shown  in  table  2. 

It  would  be  useful  for  the  reader  to  examine  the  data  element  definitions 
(given  above)  and  sample  quiddities  (table  1)  as  well  as  the  results  of  applying 
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Using  Correct  Quiddities 


Proce- 

# Synonyms 

Type  1 

Type  2 

dure 

Found  (Hits) 

Errors 

Errors 

111 

1 

0 

4 

122 

5 

2 

2 

132 

5 

2 

2 

143 

5 

2 

2 

211 

3 

1 

3 

222 

11 

7 

1 

232 

11 

7 

1 

243 

11 

7 

1 

311 

13 

11 

3 

322 

31 

27 

1 

332 

31 

27 

1 

343 

44 

39 

0 

Using  Subjects' 

Quiddities 

Proce- 

# Synonyms 

Type  1 

Type  2 

dure 

Found  (Hits) 

Errors 

Errors 

111 

1.0 

0.0 

4.0 

122 

2.3 

1.0 

3.7 

132 

2.3 

1.0 

3.7 

143 

2.7 

1.3 

3.7 

211 

2.9 

0.7 

2.8 

222 

11.1 

7.4 

1.3 

232 

10.8 

7.1 

1.3 

243 

13.0 

9.3 

1.3 

311 

10.4 

8.2 

2.8 

322 

35.4 

31.3 

0.9 

332 

31.9 

27.8 

0.9 

343 

46.1 

41.4 

0.3 

Table  3:  Detecting  Synon}'m  Problems  (Total  synonym  pairs  =  5) 
the  three  procedures  to  the  six  pairs  of  data  elements  (table  2). 

4.3.2      Synonym  Detection 

The  results  (for  detecting  s}'nonym  problems)  of  applying  these  procedures 
to  the  correct  quiddities,  and  to  quiddities  developed  in  the  experiment  (the 
numbers  represent  an  average  over  9  comparisons)  are  shown  in  table  3. 

We  draw  several  interesting  observations  from  these  results.  First,  we 
note  that  as  we  vary  the  i  of  Ptjk  from  1  to  2  there  is  an  increase  in  the 
number  of  "hits",  a  decrease  in  type  2  errors  (only  1 — or  1.3 — for  procedures 
■/^222,  -^232,  and  P243),  and  not  much  of  an  increase  in  type  1  errors.  This 
happens  since  these  procedures  allow  subjects  to  use  alternate  equivalent 
terms  (i.e.,  synonyms — e.g.,  price  and  cost)  for  each  quiddity  component.  As 
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we  vary  i  from  2  to  3,  however,  there  is  a  huge  increase  in  the  number  of  hits 
and  in  type  1  errors.  (There  is  not  much  scope  for  reduction  of  type  2  errors, 
though.)  While  not  desirable,  this  is  consistent  with  hypothesis  2.  Second, 
as  we  vary  j  from  1  to  2  or  3,  there  is  a  decrease  in  type  2  errors,  consistent 
with  hypotheses  3  and  4.  Similarly,  as  we  vary  k  from  1  to  2,  there  is  a 
decrease  in  type  2  errors,  again  consistent  with  hypothesis  4.  Third,  setting 
j  to  4  (mixing  stuff -with  stuff  type  and  stuff  arguments)  and  k  to  3  (mixing 
attribute  with  attribute  type)  is  not  too  useful  since  it  leads  to  a  huge  increase 
in  type  1  errors.  This  is  consistent  with  hypothesis  1  which  asserts  that  stuff 
and  attribute  are  the  most  significant  components. 

In  terms  of  the  relative  weight  w  {=  ^^)  we  found  that  procedures  P222 
and  P232  were  the  best  procedures^  (had  the  lowest  E^)  for  a  wide  range 
I  <  w  <  25  oi  w  values.  Only  for  w  >  25  {w  >  32  in  the  case  of  the 
correct  quiddities),  does  procedure  P343  become  more  attractive.  (Note  that 
values  of  w  less  than  1  are  not  meaningful.)  This  is  a  significant  range,  and 
suggests  that  P222  and  P232  might  be  the  best  procedures  to  use  for  UNV 
detection.  These  procedures  failed  to  detect  one  synonym  problem  (EMPH- 
AREA,EMPH-NAME),  but  that,  we  found,  was  an  unusual  case.  It  turned 
out  that  even  though  these  elements  referred  to  the  same  concept,  their 
definitions  in  the  data  dictionary  were  quite  different  (see  §4.3.1),  and  led  us 
to  include  the  stuff  type  term  curriculum  in  one  case,  and  NFS  in  another. 
These  two  procedures  did  succeed  in  pruning  the  detection  problem  from  225 
pairs  of  data  elements  (15  x  15)  to  11  pairs  (number  of  hits).  These  results 
indicate  the  usefulness  of  quiddities  in  detection  of  synonym  problems. 


'So  was  P243,  but  it  is  difficult  to  understand  why  it  should  be  so,  in  general. 
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4.3.3     Homonym  Detection 

Since  our  research  focused  primarily  on  tlie  synonym  problem,  and  since  that 
is  the  harder  one,  we  will  address  the  homonym  problem  only  briefly.  In  short, 
our  procedures  did  an  excellent  job  of  detecting  the  3  homonym  problems. 
There  were  no  type  1  errors  for  any  procedure,  and  only  procedures  P^jf.  had 
type  2  errors  when  the  correct  quiddities  were  used.  With  the  quiddities 
obtained  in  the  experiment,  procedures  Pm  and  P211  each  had  an  average  of 
0.2  type  2  errors,  and  several  other  procedures  had  an  average  of  less  than 
1.  It  seems  logical  to  conclude  that  the  stricter  procedures  Pm  and  P211  are 
best  suited  to  the  detection  of  homonym  problems. 

5      Conclusions 

The  basic  principle  underlying  our  strategy  for  detecting  naming  conflicts 
is  that  this  process  must  rel}'  on  semantic  information  about  the  data  ele- 
ments in  the  database.  We  believe  that  the  concept  of  quiddities,  as  defined 
in  Bhargava  et  al.  [4]  and  as  refined  in  this  paper,  can  effectively  capture 
the  semantic  information  necessary  for  the  detection  of  these  conflicts.  Our 
experiments,  though  conducted  on  a  small  scale,  do  provide  e\'idence  to  sup- 
port this  belief.  We  found  that  database  users  could  be  trained  to  declare 
quiddity  information  accurately  enough  that  it  could  be  used  by  inference 
procedures  to  detect  naming  conflicts.  Certain  of  our  inference  procedures 
performed  reasonably  well  in  detecting  these  conflicts.  However,  much  more 
testing  needs  to  be  done  before  any  general  conclusions  can  be  reached  from 
these  results. 

There  are  several  ways  to  obtain  higher  accuracy  and  consistency  in  quid- 
dity specification.   One  is  to  refine  the  definitions  of  the  quiddity  categories 
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and  leave  the  burden  on  the  user  to  develop  correct  and  more  accurate  quid- 
dities. The  second  is  to  shift  the  burden  to  the  inference  procedures,  by 
defining  sophisticated  procedures  that  take  into  account  inconsistencies  such 
as  differences  in  the  level  of  detail  or  specificity.  Our  approach  is  a  combina- 
tion of  these  two,  but  emphasizes  the  latter.  A  third  is  to  develop  interactive, 
automated  tools  for  supporting  the  quiddity  acquisition  process.  All  of  these 
alternatives,  and  particularly  the  last  one,  are  issues  for  further  research.  The 
"quiddities  approach"  for  the  detection  of  naming  conflicts  is  fundamentally 
different  from  other  approaches  in  its  use  of  formalized  semantic  information 
in  conjunction  with  automated  inference  procedures.  It  would  be  interest- 
ing to  examine  if  and  how  this  strategy  could  be  applied  to  other  aspects  of 
database  integration  as  well. 
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